RNA-Seq Data Analysis ◾ 179
Yg
g
g
µ
αµ
(
)=
+
var
(
)
2
(5.19)
In the quasi-negative binomial distribution, the variance is modeled as follows:
Yg
g
g
g
σ
µ
θµ
(
)
(
)=
+
var
2
2
(5.20)
The RNA-Seq study design may include a single or several conditions called factors. A
researcher usually may be interested in testing the effect of a condition. For instance,
assume that a researcher wants to study breast cancer in women. She conducted an RNA-
Seq study on samples from healthy and cancer tissues of five affected women. The analysis
programs require a matrix that describes the design called a design matrix. The design
matrix defines the model (structure of the relationship between genes and explanatory
variables), and it is also used to store values of the explanatory variable [32]. The design
matrix will be created from the study metadata as shown in Table 5.1.
The design matrix will include dummy variables setting the level of each factor to either
zero or one as we will see soon.
The generalized linear model will fit the data of this study design so that the expression
of each gene will be described as a linear combination of the dummy explanatory variables.
y
β
β
β
ε
=
+
+
+
*Patient
*Condition
0
1
2
(5.21)
where y is the response variable that represents the gene expression in a specific unit, β0
is the intercept or the average gene expression when the other parameters are zero, and β1
and β2 are the generalized linear regression parameters that represent the effect of each
explanatory variable. A log-linear model is used as
X
N
gi
i
T
g
i
µ
β
=
+
log
log
(5.22)
where Xi
T is a vector of covariates (explanatory variables) that specifies the conditions/
factors applied to sample i and
g
β is a vector of regression coefficients for the gene g.
TABLE 5.1 Sample Information or
Metadata for the Design Matrix
SampleID
Condition
Patient
norm_rep1
Norm
Rep1
norm_rep2
Norm
Rep2
norm_rep3
Norm
Rep3
tumo_rep1
Tumo
Rep1
tumo_rep2
Tumo
Rep2
tumo_rep3
Tumo
Rep3